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Abstract. Many of the currently popular “block algorithms” are scalar algorithms in which the 
operations have been grouped and reordered into matrix operations. One genuine block algorithm 
in practical use is block LU factorization, and this has recently been shown by Demmel and Higham 
to be unstable in general. It is shown here that block LU factorization is stable if A is block 
diagonally dominant by columns. Moreover, for a general matrix the level of instability in block 
LU factorization can be bounded in terms of the condition number k(A) and the growth factor for 
Gaussian elimination without pivoting. A consequence is that block LU factorization is stable for 
a matrix A that is symmetric positive definite or point diagonally dominant by rows or columns as 
long as A is well-conditioned. 

Key words: block algorithm, LAPACK, level 3 BLAS, iterative refinement, LU factorization, 
backward error analysis, block diagonal dominance. 

AMS(MOS) subject classifications, primary 65F05, 65F25, 65G05. 


1 Introduction 

Block methods in matrix computations are widely recognised as being able to achieve high perfor- 
mance on modern vector and parallel computers. Their performance benefits have been investigated 
by various authors over the last decade (see, for example, [11, 14, 15]), and in particular by the 
developers of LAPACK [1]. The rise to prominence of block methods has been accompanied by the 
development of the level 3 Basic Linear Algebra Subprograms (BLAS3) — a set of specifications of 
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Fortran primitives for various types of matrix multiplication, together with solution of a triangular 
system with multiple right-hand sides [9, 10]. Block algorithms can be cast largely in terms of calls 
to the BLAS3, and it is by working with these matrix-matrix operations that they achieve high 
performance. (For a detailed explanation of why matrix-matrix operations lead to high efficiency 
see [8] or [16].) 

While the performance aspects of block algorithms have been thoroughly analyzed, numerical 
stability issues have received relatively little attention. This is perhaps not surprising, because most 
block algorithms in practical use automatically have excellent numerical stability properties. Indeed, 
Demmel and Higham [7] show that all the block algorithms in LAPACK are as stable as their point 
counterparts. However, stability cannot be taken for granted. LAPACK includes a block algorithm 
for inverting a triangular matrix that is a generalization of a standard point algorithm. During the 
development of LAPACK another, equally plausible block generalization was considered — this one 
was found to be unstable [12]. 

In this work we investigate the numerical stability of a block form of the most important of 
all matrix factorizations, LU factorization. What we mean by “block form” needs to be explained 
carefully, since the adjective “block” has more than one meaning in the literature. We will use the 
following terminology, which emphasises an important distinction and leads to insight in interpreting 
stability results. 

A partitioned algorithm is a scalar (or point) algorithm in which the operations have been grouped 
and reordered into matrix operations. The partitioned form may involve some extra operations over 
the scalar form (as is the case with algorithms that aggregate Householder transformations using 
the WY technique of [4]). 

A block algorithm is a generalization of a scalar algorithm in which the basic scalar operations 
become matrix operations (a + /?, a/?, a/P become A + 5, AB and AB ~ l ), and a matrix property 
based on the nonzero structure becomes the corresponding property biockwise (in particular, the 
scalars 0 and 1 become the zero matrix and the identity matrix, respectively). A block factorization 
is defined in a similar way, and is usually what a block algorithm computes. 

The distinction between a partitioned algorithm and a block algorithm is rarely made in the 
literature (an exception is the paper [24]). The term “block algorithm” is frequently used to describe 
both types of algorithm. A partitioned algorithm might also be called a “blocked algorithm” (as is 
done in [8]), but the similarity to “block algorithm” can cause confusion and so we do not recommend 
this terminology. Note that in the particular case of matrix multiplication partitioned and block 
algorithms are equivalent. 

LAPACK contains only partitioned algorithms. A possible exception is the multi-shift Hessenberg 
QR iteration [2], which could be regarded a block algorithm, even though it does not work with a 
block Hessenberg form. As this example indicates, not all algorithms fit neatly into one class or the 
other, so our definitions should not be interpreted too strictly. 
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Block LU factorization is one of the few block factorizations in practical use. It takes the form 



An A \2 A\z 


■ 1 


'U n Ui 2 U 13 ‘ 

A = 

Mi M 2 Ms 

= 

l 2 i 1 


U 22 U 22 


m Mi A 32 A 33 _ 


m Lz\ L 32 I 


C/33. 


where, for illustration, we are regarding A as a block 3x3 matrix. L is block lower triangular with 
identity matrices on the diagonal (and hence is lower triangular), and U is block upper triangular 
(but the diagonal blocks Uu are not triangular, in general). 

Block LU factorization has been discussed by various authors; see, for example, [5, 15, 23, 24]. 
It appears to have first been proposed for block tridiagonal matrices, which frequently arise in the 
discretization of partial differential equations [16, Sec. 4.5.1], [21, p. 59], [22], [26]. An attraction 
of block LU factorization is that one particular implementation has a greater amount of matrix 
multiplication than conventional LU factorization (see section 2), and this is likely to make it more 
efficient on high-performance computers. 

By contrast with (1.1), a standard LU factorization can be written in the form 



'Ln 


U a U 12 U 13 

A = 

L 21 L 22 


U 22 u 23 


_£ 31 L32 £33. 


Uzs m 


where L is unit lower triangular and U is upper triangular, A partitioned version of the outer 
product LU factorization algorithm (without pivoting) computes the first block column of L and 
the first block row of U as follows. An = L\\U \\ is computed as a point LU factorization, and 
the equations LnUn = An and LnUu = An are solved for La and U 1 = 2,3. The process is 
repeated on the Schur complement, 


' Ao 2 A?s 
. A32 A33 , 



^13] • 


This algorithm does the same arithmetic operations as any other version of standard LU factoriza- 
tion, but in a different order. 

Demmel and Higham [7] have recently shown that block LU factorization can be unstable, even 
when A is symmetric positive definite or diagonally dominant by rows. This instability had previously 
been identified and analysed in [3] in the special case where A is a particular row permutation of 
a symmetric positive definite block tridiagonal matrix. The purpose of this work is to gain further 
insight into the instability of block LU factorization. We also wish to emphasise that of the two 
classes of algorithms we have defined it is the block algorithms whose stability is most in question. 
We know of no examples of an unstable partitioned algorithm. (Those partitioned algorithms based 
on the aggregation of Householder transformations that do slightly different arithmetic to the point 
versions have been shown to be stable [4, 7]). 

In section 2 we derive backward error bounds for block LU factorization and for the solution of a 
linear system Ax = b using the block LU factors. In section 3 we show that block LU factorization is 
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stable if A is block diagonally dominant by columns; this generalizes the known results that Gaussian 
elimination without pivoting is stable for column diagonally dominant matrices [28] and that block 
LU factorization is stable for block tridiagonal matrices that are block diagonally dominant by 
columns [26], We also show that for a general matrix A the backward error is bounded by a product 
involving /c(A) and the growth factor p n for Gaussian elimination without pivoting on A. If A is 
(point) diagonally dominant this bound simplifies because p n <2. If A is diagonally dominant by 
columns we show that a potentially much smaller bound holds that depends only on the block size. 
In section 4 we specialize to symmetric positive definite matrices and show that the backward error 
can be bounded by a multiple of * 2 (A) 1 / 2 . Block LU factorization is thus conditionally stable for 
symmetric positive definite and diagonally dominant matrices: it is guaranteed to be stable only if 
A is well-conditioned. Results of this type are rare for linear equation solvers based on factorization 
methods, although stability results conditional on other functions of A do hold for certain iterative 
linear equation solvers [20, 29]. 

In section 5 we present some numerical experiments that show our error bounds to be reasonably 
sharp and reveal some interesting numerical behaviour. Concluding remarks are given in section 6. 


2 Error Analysis of Block LU factorization 


We consider a block LU factorization A = LU E IR nxr \ where the diagonal blocks in the partitioning 
are square but do not necessarily all have the same dimension. 

If An € IR rxr is nonsingular we can write 


Mi An _ I 0 An M 2 
.Aii An. .L 11 /. . 0 S . 


(2.1) 


which describes one block step of an outer product based algorithm for computing a block LU 
factorization. Here, 5 = An — AnA^ An is a Schur complement of A. If the (1,1) block of S of 
appropriate dimension is nonsingular then we can factorize 5 in a similar manner, and this process 
can be continued recursively to obtain the complete block LU factorization. The overall algorithm 
can be expressed as follows. 


Algorithm BLU. 

This algorithm computes a block LU factorization A = LU 6 IR axn . 

1. Un = An , Un — An • 

2. Solve LnAn = An for Ln. 

3. 5 = An — LnAn (Schur complement). 

4. Compute the block LU factorization of 5, recursively. 


Given the block LU factorization of A, the solution to a system Ax = 6 can be obtained by 
solving Lx = y by forward substitution (since L is triangular) and solving Ux = y by block back 
substitution. There is freedom in how step 2 of Algorithm BLU is accomplished, and how the linear 
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systems with coefficient matrices Uu that arise in the block back substitution are solved. The two 
main possibilities are as follows. 

Implementation 1: An is factorized by Gaussian elimination with partial pivoting (GEPP). 
Step 2 and the solution of linear systems with Uu are accomplished by substitution with the LU 
factors of An, 

Implementation 2: Af^ is computed explicitly, so that step 2 becomes a matrix multiplication 
and Ux = y is solved entirely by matrix- vector multiplications. This approach is attractive for 
parallel machines [15, 24]. 

We now give an error analysis for Algorithm BLU, under the following model of floating point 
arithmetic, where u is the unit roundoff: 

fl(x±y) = z(l + a) ±y(l + /?), |a|, \(3\ < u, 

fl(xopy) = (xopy)(l + $), |$| < u, op = *,/. 

It is convenient to use the matrix norm defined by 

\\A\\ = max|a, 7 |. (2.2) 

Note that if A € IR m * n and B € IR nxp then ||.AB|| < n||j4||||B|| is the best such bound; this 
inequality affects some of the constants in our analysis and will be used without comment. 

We assume that the computed matrices Z -21 from step 2 of Algorithm BLU satisfy 

L 21 A n = A 21 + E n , ||£ ai || < c n «||Z 21 ||||An|| + 0(u 2 ), (2.3) 

where c n denotes a constant depending on n (we are not concerned with the precise values of the 
constants in this analysis). We also assume that when a system Uuxi = d{ is solved, the computed 
solution Xi satisfies 


(Uu + A Uii)x { = rf if \\AUuW < c'^WUuW + 0(u 2 ). (2.4) 

The assumptions (2.3) and (2.4) are satisfied for implementation 1 and are sufficient to prove the 
following result. 

Theorem 2.1 Let L and U be the computed block LU factors of A € IR nxn from Algorithm BLU, 
and let x be the computed solutton to Ax = 6. Under assumptions (2.3) and (2.4), 

LU = A + E, ||£7f| < rf««(||A|| + ||Z|||!£/||) -4- C»(« 2 ), (2.5) 

(• A + AA)x = 6, ||A>l||<d>(|M|| + P||TO + 0(u 2 ). (2.6) 

Proof. Standard error analysis for matrix multiplication [16, p. 66] shows that in step 3 of the 
first block stage of the factorization, 

S = Aii - £ 21^12 + &S, ||AS|| < c„u(||i422|| + + 0(u 2 ). 
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The remaining stages of the factorization compute the block LU factorization S « LsUs , which, 
inductively, we can assume satisfies 

LsUs = S + E s , ||£s|| < c;«(||5|| + ps||||£s||) + 0(u 2 ). 

Using (2.3) we have 



and so 

I \A - LU\\ < c"u max(||L 21 |||Mu||, ||A 22 || + ||£ 2 i|||M 12 || + ||S|| + |M|||t/s||) + 0(u 2 ) 

< d nU (||>i|| + ||2||||c/||) + o( u 2 ). 

The system Ax = b is solved via Ly = b and Ux = y. Since L is triangular we have from a 
standard result [16, sec. 3.1] that 

( L + AZ)y = b , \\AL\\<c n ii\\L\\ + 0(u 2 ). (2.7) 

For Ux = y consider the first block row, which can be written 

U\\x i = y - U i2 X 2 . 

In the last stage of block back substitution £2 is known and this equation is solved for xi. Accounting 
for the error in forming the right-hand side, and invoking (2.4), we have 

(Un + A£/n)2 x = y- (U l2 + A 0 l2 )x 2 , ||A£7y|| < c>||t/ 1; || + 0( u 2 ). 

Since analogous equations hold for all the block rows, we have, overall, 

(U + A U)x = y, \\AU\\ < c' n u\\U\\ + 0(u 2 ). (2.8) 

Combining (2.5), (2.7) and (2.8) we have 

b = (L + AL)(£/ + AU)x 

= (A + E + AZf/ 4- LAU + A2Atf)£ 

= (A + AA)x, 

and ||AA|| is bounded as in (2.6). I 

Theorem 2.1 shows that the stability of block LU factorization is determined by the ratio 
||L||||£?||/||j 4|| (the sharpness of the bounds is demonstrated in the numerical experiments of sec- 
tion 5). If this ratio is reasonably bounded, by a modest function of n, say, then L and U are 
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the true factors of a matrix close to A , and x solves a slightly perturbed system. It was noted in 
[7] that ||L||||f/|| can exceed \\A\\ by an arbitrary factor, even if A is symmetric positive definite or 
diagonally dominant by rows. Indeed, |[L|| > ||^ 2 i[| = 11^21-4 7/ II > using the partitioning (2.1), and 
this lower bound for ||L|| can be arbitrarily large. In the following two sections we investigate this 
instability more closely and show that ||L||||Cf|| can be bounded in a useful way for particular classes 
of A. Without further comment we make the reasonable assumption that ||L||||(/|| « ||L||||f/||, so 
that these bounds may be used in Theorem 2.1. 

We mention that the bounds in Theorem 2.1 are valid also for other version of block LU factor- 
ization obtained by “block loop reordering”, such as a block gaxpy based algorithm [16, p. 101]. 

Finally, we comment on implementation 2. Suppose, for simplicity, that the inverses A^ (which 
are used in step 2 of Algorithm BLU and in the block back substitution) are computed exactly. 
Then the best bounds of the forms (2.3) and (2.4) are 

T 21 A 11 = A2i+i?2i, ||^2i|| £ c n u/c(An)||A2i|| + 0(u 2 ), 

(Uu Hr Af 7u)2i = d iy ||At/»i|| < c^k^OII^II + 0(u 2 ). 

Working from these results, we find that Theorem 2.1 still holds provided the first order terms in 
the bounds in (2.5) and (2.6) are multiplied by max,* k(Uh). This suggests that implementation 2 of 
Algorithm BLU can be much less stable than implementation 1 when the diagonal blocks of U are 
ill-conditioned, and this is confirmed by the numerical results in section 5. 

3 Diagonal Dominance 

One class of matrices for which block LU factorization has long been known to be stable is block 
tridiagonal matrices that are block diagonally dominant. A general matrix A E IR nxn is block 
diagonally dominant by columns, with respect to a given partitioning A.= (Aij) and a given norm, 
if, for all j y 

ii^ 1 ir 1 -EiKii = T,>o. ( 3 . 1 ) 

A is block diagonally dominant by rows if A T is block diagonally dominant by columns. For the block 
size 1 the usual property of point diagonal dominance is obtained. Note that for the 1 and oo-norms 
diagonal dominance does not imply block diagonal dominance, nor does the reverse implication hold. 

Block diagonal dominance was introduced in [13], and has mostly found use in generalizations 
of the Gershgorin circle theorem. However, Varah [26] proved that if A is block tridiagonal and has 
the block LU factorization A = LU (so that L and U are block bidiagonal and Ui^+x = A,* >t+ 1 ), 
then if A is block diagonally dominant by columns 

||Lm-iII< 1. Il%ll<ll^«|| + ||i4<-.i i i|| > (3.2) 

while if A is block diagonally dominant by rows 

Pm- ill < IP.'ill < Uu\\ + Pm-i||. (3.3) 


7 



Here, the norm is assumed to be a subordinate matrix norm. For the oo-norm the inequalities 
(3.2) imply that ||L||oo < 2 and ||C/||oo < 3||>l||oo, so block LU factorization is stable if A is block 
diagonally dominant by columns. Similarly, if A is block diagonally dominant by rows we have 
stability if ||>lt t i-i||/||j4,‘-M|{ is suitably bounded for all i. 

Varah’s results can be extended to full, block diagonally dominant matrices, as we now explain. 
First, we show that for such matrices a block LU factorization exists, using the key property that 
block diagonal dominance is inherited by the Schur complements obtained in the course of the 
factorization. In the following analysis we assume that A has m block rows and columns. 

Lemma 3.1 Suppose A 6 IR nxn is nonsingular and block diagonally dominant by rows or columns 
with respect to a subordinate matrix norm in (3.1). Then A has a block LU factorization, and all 
the Schur complements arising in Algorithm BLU have the same kind of diagonal dominance as A. 


Proof. The proof is a generalization of the corresponding result for point diagonal dominance 
[16, p. 20], [28]. We consider the case of block diagonal dominance by columns; the proof for row-wise 
diagonal dominance is analogous. 

Let 


'U 11 U 12 
. o 5 

denote the matrix obtained from A after one step of Algorithm BLU. For 2 < j < n we have 


(S3 

•*> 


Dl4 2) H = 

»ss3 

< e ik 11+ 11^1111^11 £ ii a, ,11 

»=2 i=2 

< E IKII + ll^ullll^n ll(Mn 1 r 1 - Mill), using (3.1), 

i = 2 

= ElKII + Miill-KIIMnllKill 


t=2 


< llVir'-KIIIMnllMiill. using (3.1), 

= min \\Ajjx\\ - ||Ai i ||||An l ||||A ;1 || 

ll*ll=i 

< min ||(A j7 - Aj\Aii A\j)x\\ 

ll*ll=i 

= min llA^atll. 

11*11=1 31 


(3.4) 


Now if AjV is singular it follows that ll^y^ll = therefore A^ 2 \ and hence also A, is 

singular, which is a contradiction. Thus A is nonsingular, and (3.4) can be rewritten 


e Mi/’ii < psr'i 


1-1 


i=2 

i*3 
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showing that A is block diagonally dominant by columns. The result follows by induction. | 
The next result allows us to bound \\U\\ for a block diagonally dominant matrix. 


Lemma 3.2 Let A satisfy the conditions of Lemma 3.1 . If A^ denotes the matrix obtained after 
k — 1 steps of Algorithm BLU , then 

.max H^ll < 2 max ||j4 ( /||. 

Proof. The proof is a straightforward generalization of Wilkinson’s proof of the corresponding 
result for point diagonally dominant matrices [28, pp. 288-289]. Let A be block diagonally dominant 
by columns (the proof for row diagonal dominance is similar). Then 

m m 

Eh4 2) h = 

»=2 i =2 

m m 

< £ ||A <y || + \\A u \\Un l II Ell^iH 

1=2 i=2 


*=i 


using (3.1). By induction, using Lemma 3.1, it follows that ||A^|| < This yields 


max ||A^|| < max V' \\A^\\ < max Y"* ||v4 t -y||. 

k<i,j <m lJ 11 “ k<j<m^ 11 11 “ k<j<m ^ 

~ ~ i = ib ~ ~ x = l 

i-From (3.1), Moll < Mf/ll -1 < Mj;ll> so 

PifllS 2 Mijll < 2 m ax ||ily/|| = 2 max M<;| 

k<i,j<m J k<j <m l<j<m 1 <tj <m 


The implications of Lemmas 3.1 and 3.2 for stability are as follows. Suppose A is block diagonally 
dominant by columns. Also, assume that the norm has the property that 


max M«;ll < ||4|| < £]M*;II> (3.5) 

which holds for any p-norm, for example. Then Lemma 3.1 implies that • • • > ^m>] T || < 1 

for each subdiagonal block column of L, and since Uij = for j > z, Lemma 3.2 shows that 
IlCfyll < 2||A|| for each block of U. Therefore ||L|| < m and \\U\\ < m(m + 1)||A||, and so ||L||||£/|| < 
m 2 (m -f 1)||A||. For particular norms the bounds on the blocks of L and U yield a smaller bound 
for ||L|| and \\U\\. For example, for the 1-norm we have ||L||i||{/||i < 2m||A||i and for the oo-norm 
||L||oo||£/||oo < 2m 2 ||A||oo- We conclude that block LU factorization is stable if A is block diagonally 
dominant by columns with respect to any subordinate matrix norm satisfying (3.5). 
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Unfortunately, block LU factorization can be unstable when A is block diagonally dominant by 
rows. For although Lemma 3.2 guarantees that < 2||A||, ||Z,|| can be arbitrarily large. This 

can be seen from the example 



where A is block diagonally dominant by rows in any subordinate norm for any nonsingular matrix 
An. ^ is easy to confirm numerically that block LU factorization can be unstable on matrices of 
this form. Note that if the block size is 1 then we do have stability, since block LU factorization is 
equivalent to Gaussian elimination (GE) and the growth factor is bounded by 2 [28] (see Lemma 3.2). 

It is also of interest to bound ||L||||f/|| for a point diagonally dominant matrix, since this property 
is much easier to check than block diagonal dominance. We will derive a bound for ||L||||[/|| for a 
general matrix and then specialise to diagonal dominance. We partition A according to 

a =\ A a U t 2 l> ^nSlR rxr . (3.6) 

An. 

In the rest of this section we use the norm (2.2) and p n denotes the growth factor for Gaussian 

/ L \ 

elimination (GE) without pivoting, that is, p n = maxjj,* |/|a t; | in the usual notation. We 
assume that GE applied to A succeeds. 

Lemma 3.3 If A £ IR nxn then ||^2i^n 1 |l 5: n Pn^(A). 

Proof. ^From (2.1) it can be seen that (A~ l ) 2 i = — S -1 AnA^ , where the Schur complement 
S =■ An *” AnA^i A\i. Hence 

\\A 21 Af 1 W < nj|S||||(^4 _1 )2i|j < n II^IIIM _ l ||* 

S is the trailing submatrix that would be obtained after r — 1 steps of GE. It follows immediately 
that ||5|| < p n \\A\\. I 

Lemma 3.4 If A £ ]R nxn then the Schur complement S = An — AnA^ An satisfies k (5) < 

Pn*(A). 

Proof. j|5|| < pnWAW, as noted in the proof of Lemma 3.3, and ||5 -1 || < ||^4” l || because 5“ l is 
the (2, 2) block of A" 1 , as is easily seen from (2.1). I 

To bound ||L|| note that, under the partitioning (3.6), for the first block stage of Algorithm BLU 
we have ||L 2 i(| “ ll^ 2 i^n 1 || < np n K(A) by Lemma 3.3. Since the algorithm works recursively with 
the Schur complement 5, and since k(S) < p n K(A) (by Lemma 3.4), each subsequently computed 
subdiagonal block of L has norm at most np^n(A). Since U is composed of elements of A together 
with elements of Schur complements of A , \\U\\ < /? n ||A||. Overall, for a general matrix A £ IR rtXn , 

IWini < npl K (A) ■ p„\\A\\ = np®«(ii)||i4||. 
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Thus block LU factorization is stable for a general matrix A as long as GE is stable for A (that is, 
pn = 0(1)) and A is well-conditioned. 

If A is diagonally dominant by rows or columns then p n < 2 [28], as noted above. Hence for a 
diagonally dominant matrix A , 

^Hilt'll < 8»«(il)||i4||, (3.7) 

that is, stability depends only on the condition of A . 

The upper bound in Lemma 3.3 gives about the best possible bound for row diagonally dominant 
matrices, but a potentially much smaller bound holds under column diagonal dominance, as we now 
explain. Consider the standard LU factorization, 



where A\\ E IR rxr . Equating this factorization with (2.1), we see that 

L21 = i42i^4 11 1 = A21U n L n = L 2 iL lx . 

If A is diagonally dominant by columns then the multipliers for GE are all bounded by 1 in absolute 
value (from Lemma 3.1 with block size 1), or, equivalently, no row interchanges are required by 
partial pivoting. This implies that ||Loi|| < 1 and ||L U 1 || < 2 r ~ 2 , and so 

Pull < n2 r - 2 . (3.9) 

The bound for ||L u l || is not attainable, for if a unit lower triangular T 6 IR nxn satisfies ||T , ~ 1 || = 2 n ~ 2 
then t{j = — 1 for all i > j } and this implies that the first column of A = TU is Un(l, — 1, — 1,...,— l) T , 
so A is not diagonally dominant by columns. In any case, ||£ 1 j 1 || is typically 0(r) in practice, assum- 
ing only that partial pivoting requires no row interchanges (A 2 PA) (and thus not fully exploiting 
the diagonal dominance) [25]. No such bound as (3.9) holds for row diagonal dominance, because in 
this case there is no a priori bound on the multipliers. 

For a column diagonally dominant matrix (3.7) and (3.9) give 

PIIIPII < 2n min(2 r-2 , 4/c(j4))|| j 4|| , (3.10) 

where the maximum block size is r. 

To summarise, for (point) diagonally dominant matrices stability is guaranteed if A is well- 

conditioned. This in turn is guaranteed if the block column diagonal dominance amounts 7 j in 

(3.1) are sufficiently large relative to ||v4||, because for the oo-norm and any block sizes in (3.1), 
\\A~ ^li < (minj 7 j)" 1 [27]. In the case of column diagonal dominance, stability is guaranteed for 
small block sizes r irrespective of k {A) } by (3.10). 


4 Symmetric Positive Definite Matrices 


Let A be a symmetric positive definite matrix, partitioned as 

An A 01 

A= 4 l / , An 6 1 

./I21 A22. 
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The definiteness implies certain relations among the submatrices A i; - that can be used to obtain a 
stronger bound for ||L||2 than can be deduced from Lemma 3.3. 

Lemma 4.1 If A is symmetric positive definite then HA21A7/ lb < «2 (A) l/2 . 


Proof. Let A have the Cholesky factorization 


f?l'l 

0 1 

r #12 

.RL 

1 

r W 

oT 

o 


Then A?iA n x = R% 2 R n ■ f? n 1 f2J" 1 T = R^ 2 R l ^ , so 


R u €!R rxr . 


\\AiiAu% < IMHKi 1 Ih < Il^ll2||^- l ||2 = k,(R) = *2 (A) 1 '*. 


Note : At the cost of a much more difficult proof, Lemma 4.1 can be strengthened to the attainable 
bound IIA21 A11W2 < (^(A) 1 ^ 2 — K2(A) -1 / 2 )/2, as shown in [6, Theorem 4], but the weaker bound 
is sufficient for our purposes. 

The proof of the following lemma is similar to that of Lemma 3.4. 

Lemma 4.2 If A is symmetric positive definite then the Schur complement S = A22 — ^21^1x^21 
satisfies K2 (5) < *2 Ca- 
using the same reasoning as in the last section, we find that each subdiagonal block of L is 
bounded in 2-norm by /C2 (A) 1 / 2 . Therefore ||L||2 < 1 + m«2( J 4) 1 ^ 2 , where there are m block stages 
in the algorithm. Also, it can be shown that \\U\\2 < >/m||A||2. Hence 

\\Lh\\U\\ 2 < Vm(l + mK 2 (A) l ' 2 )p|| 2 . (4.1) 

It follows from Theorem 2.1 that when Algorithm BLU is applied to a symmetric positive definite 
matrix A, the backward errors for the LU factorization and the subsequent solution of a linear 
system are both bounded by 

c„\/mu||A|| 2 (2 + m/c 2 (A) 1/2 ) + 0(u 2 ). (4.2) 

Any resulting bound for ||x — x|| 2 /||x|| 2 will be proportional to /c 2 (A) 3 ^ 2 , rather than k 2 (A) as for a 
stable method. This suggests that block LU factorization can lose up to 50% more digits of accuracy 
in x than a stable method for solving symmetric positive definite linear systems. 

Note that the «2 (A) 1 / 2 term in (4.2) can be pessimistic, because it is clear from the proof of 
Lemma 4.1 that it is terms ||{/i7 1 |l2^ 2 > where Ua is a diagonal block of U , that influence the error 
bounds, and ||J/ f 7 l ||2^ 2 < HA” 1 )!^ 2 . One would expect the backward error to increase with the block 
size, with a backward error of size (4.2) being nearly attainable for a sufficiently large block size. 

Our main conclusion is that block LU factorization is guaranteed to be stable for a symmetric 
positive definite matrix A if A is well-conditioned. 
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One might wonder whether block LDL T and block Cholesky factorizations have better stability 
properties than block LU factorization. A genuine block Cholesky factorization A = R T R would 
use matrix square roots (R\\ = A x [ , etc.), which makes this factorization too expensive, whereas 
a partitioned Cholesky factorization is numerically equivalent to the point case. A block LDL T 
factorization, where D = diag(Du> . . . , D mm ), is feasible to compute, but it is easily shown to have 
analogous stability properties to block LU factorization. 


5 Numerical Experiments 


We describe some numerical experiments that give further insight into the analysis presented above. 
The computations were performed in MATLAB, which has unit. roundoff u = 2“ 53 « 1.1 x 10~ 16 . 

We use the following two matrices, which were also used in [7]. The symmetric positive definite 
Moler matrix [19] is defined by A n ( a) = Rn(a) T R n (a), where Rn(oc) is unit upper triangular 
with all the off-diagonal elements equal to c*. The Dorr matrix D n (a) [19], is an unsymmetric, 
row diagonally dominant tridiagonal matrix. D n (a) has row diagonal dominance factors := 
|d,»| — — |d*,i+i| = (n + 1 ) 2 a for i = 1, n and ji = 0 otherwise, and we perturbed the diagonal 

elements dri, . . . , d n -i rn -i to ensure that 7,* > 10” 14 for the computed matrix. Neither of these two 
matrices is row or column block diagonally dominant for any block sizes in the 1, 2 and 00 norms. 

In the first experiment we chose x = e = (1, 1, . . . , 1) T , formed 6 = Ax and solved for x using 
Algorithm BLU with implementations 1 and 2. One step of iterative refinement in fixed precision 
was done, yielding a corrected solution y. We report the relative residuals 


res(L, U) 


WA-LUWcq 

Halloo 


res(J) = 


||i4S — 6|| e 


IMIlc 


+m\ 


and the forward error 


err(x) = 


\x - x 


Note that res(x) is the normwise backward error of x (see, e.g., [17]), and that, approximately, 
err(x) < Koo(A)res(x). We also report the upper bounds for res(Z, U) and res(x) from Theorem 2.1, 
which modulo the constant terms are both approximately 


bound 1 = 


«l|2|Ul£||oo 

IMIloo 


the corresponding bound for implementation 2 is 


bound 2 = max/Coo(i/ti)boundi. 

The results for the Moler matrix A\$(—2) are shown in Tables 5.1 and 5.2. Note that we know 
the exact solution x because A has integer entries with K-l < 61 and so 6 = Ax is formed exactly. 
We comment on several interesting features. 

(1) For implementation 1 instability is revealed in the residuals for both the factorization and for 
x; it increases with the block size, as is to be expected (see the discussion at the end of section 4). 
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The values for boundi show that the theoretical error bounds correctly model the variation of the 
residuals with the block size and are mostly within two orders of magnitude of the actual residuals. 

(2) Implementation 2 is much more unstable than implementation 1 as a means of computing 
the block LU factorization. The residuals of the computed solutions x are as small as for imple- 
mentation 1 but the forward errors are mostly larger. The quantity bound 2 is very pessimistic as 
an estimate of the residuals. 

(3) One step of iterative refinement works extremely well for implementation 1, but it is ineffective 
for most block sizes with implementation 2. Theoretical backing for iterative refinement in fixed 
precision can be given using Theorem 2.1 together with Theorem 2.1 of [18]; see the discussion in 
section 2.2 of [7]. For implementation 2 the instability is too severe for iterative refinement to work. 

(4) The forward errors for y in implementation 1 reflect the ill-condition of the problem. It is 
not clear why the forward errors for x are no larger than those for the “more stable” solution y. 

(5) For the block size 15 (m = 2) with implementation 1, 

||I|| 2 ||£|| 2 * 3 x 10 9 a 0.05« 2 (A) 1 / 2 ||^|| 2 , 
which shows that (4.1) is reasonably sharp. 

We solved another system with the same coefficient matrix and with b = e. Now a; is a “large- 
normed” solution, that is, ||x||oo = 1 ||oo ||^||oo ) (indeed, ||x||oo = ll^” 1 ||oo||6||oo since A~ l > 

0). For this right-hand side the instability in the block LU factorization does not affect x for 
implementation 1: res(x) < 5 x 10” 19 for all block sizes. In our experience this behaviour is not 
uncommon for large-normed solutions. 

Table 5.3 reports results for the Dorr matrix £>i6(10” 4 ), for implementation 1 with x t - = i. 
In computing the err(-) quantities for the Dorr matrix we approximated the true solution by the 
computed solution from Gaussian elimination with partial pivoting. The results for implementation 2 
are very similar. We see more severe instability than for the Moler matrix. One step of iterative 
refinement is not sufficient to achieve a residual of order u. It is surprising that despite the instability 
evident for the block size 15, the magnitude of the error err(x) indicates that x is about as accurate 
as the solution from GEPP. For the block size 15 with implementation 1, 

||£|UI|£||oo « 8 x 10 15 * 0.4#eoo(i4)||i4||oo. 
confirming that (3.7) is reasonably sharp. 

We also solved the Dorr matrix system with b = e and found the results to be very similar to 
those in Table 5.3. Thus, although x is now a large-normed solution, the instability in the LU 
factorization is still fully reflected in x. We solved the same systems with the transpose of the Dorr 
matrix, which is diagonally dominant by columns. All the relative residuals for Implementation 1 
were less than 3u. Implementation 2 behaved erratically: for the system with x t - = z, res(£, U) < 3 u 
but res(x) was as large as 5 x 10~ 4 ! In this example ||£||oo||^||oo/||A||oo was approximately equal to 
the block size r, so the 2 r “ 2 bound in (3.10) is pessimistic here. 
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We conclude from these experiments that our backward error bounds for implementation 1 of 
Algorithm BLU are almost attainable and they seem to capture well the behaviour of the backward 
error. We have also observed some varied and interesting behaviour, all of which is within the 
freedom afforded by the error bounds, but not all of which is easily explained heuristically. 
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Table 5.1: Moler matrix. Implementation 1. x = e. 


*oo(^16(-2)) % 7 X 10 16 


block size 

res(X, U) 

bound* 

res(J) 

res(y) 

err(x) 

err(y) 

i 

0.00 

2.34e-16 

0.00 

0.00 

0.00 

0.00 

2 

0.00 

4.87e-16 

0.00 

0.00 

0.00 

0.00 

3 

6.64e-17 

2.91e-15 

6.27e-17 

0.00 

6.02e-2 

6.03e-2 

4 

2.56e-16 

8.41e-15 

1.24e-16 

0.00 

2.75e-2 

4.79e-2 

5 

3.67e-16 

3.39e-14 

3.14e-16 

6.27e-17 

3.15e-2 

1.31e-l 

6 

1.18e-15 

8.35e-14 

3.14e-16 

0.00 

3.46e-2 

4.67e-2 

7 

4.09e-15 

2.93e-13 

2.70e-15 

0.00 

2.24e-2 

5.40e-2 

8 

1.66e-15 

4.98e-13 

2.38e-15 

0.00 

8.04e-3 

5.89e-2 

9 

1.02e-14 

1.65e-12 

5.58e-15 

6.12e-17 

2.57e-2 

5.02e-2 

10 

1.14e-13 

5.35e-12 

6.30e-14 

0.00 

1 .2 le- 1 

7.74e-3 

11 

1.56e-13 

1.71e-ll 

8.65e-14 

0.00 

4.81e-2 

1.28e-l 

12 

7.63e-13 

5.38e-ll 

1.13e-13 

0.00 

9.31e-2 

9.04e-2 

13 

3.89e-13 

1.68e-10 

1.24e-12 

5.94e-17 

4.03e-3 

1.13e-l 

14 

1.71e-12 

5.17e-10 

3.55e-12 

3.92e-18 

1.64e-2 

2.88e-2 

15 

2.95e-ll 

1.58e-9 

1.32e-ll 

0.00 

1.17e-l 

3.20e-2 


Table 5.2: Moler matrix. Implementation 2. x = e. 
*ooOM-2)) «7x 10 16 


block size 

res(L, U) 

bound2 

res(J) 

res(y) 

err(J) 

err(y) 

1 

0.00 

2.34e-16 

0.00 

0.00 

0.00 

0.00 

2 

0.00 

2.39e-14 

0.00 

0.00 

0.00 

0.00 

3 

3.36e-16 

2.31e-12 

6.27e-17 

6.18e-17 

2.58e-2 

3.10e-2 

4 

2.40e-15 

1.06e-10 

6.27e-16 

0.00 

3.97e-l 

2.08e-l 

5 

1.47e-14 

6.17e-9 

2.57e-15 

2.90e-17 

1.38e0 

1.63e-l 

6 

1.13e-13 

2.04e-7 

8.32e-15 

9.41e-17 

5.32e0 

1.07e-l 

7 

5.95e-13 

9. Ole-6 

3.11e-13 

6.12e-17 

1.72e0 

4.87e-2 

8 

1.83e-ll 

1.85e-4 

1.34e-13 

3.78e-14 

1.69e2 

1.24e0 

9 

1.18e-10 

7.07e-3 

3.20e-13 

1.28e-13 

3.39e2 

5.8 le- 1 

10 

4.82e-10 

2.59e-l 

1.45e-ll 

8.37e-14 

6.52el 

9.32e-2 

11 

1.93e-8 

9.15e0 

2.82e-12 

2.30e-12 

6.79e3 

4.67el 

12 

l.lle-7 

3.13e2 

8.08e-12 

5.89e-12 

1.25e4 

5.00el 

13 

1.54e-6 

1.04e4 

2.69e-ll 

2.19e-ll 

6.32e4 

3.90e2 

14 

8.49e-6 

3.38e5 

1.38e-10 

2.4 7e-10 

8.65e4 

l.06e2 

15 

1.13e-4 

1.08e7 

2.55e-10 

2.04e-10 

4.51e5 

2.39e3 
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Table 5.3: Dorr matrix. Implementation l. x = (1, 2, . . . , n) T . 
*co(£>i6(10- 4 )) » 2 x 10 15 


block size 

res(L, U) 

boundi 

res(9) 

res( 3 /) 

err(J) 

err(y) 

i 

5.89e-17 

2.49e-14 

6.76e-17 

2.14e-17 

2.10e-4 

5.72e-4 

2 

2.92e-14 

4.77e-12 

5.67e-15 

2.85e-17 

1.84e-4 

7.37e-4 

3 

1.06e-12 

3.38e-10 

7.65e-13 

5.09e-16 

1.49e-4 

3.77e-4 

4 

7.11e-12 

1.00e-8 

6.03e-12 

6.95e-15 

1.91e-4 

5. 01e-4 

5 

7.59e-9 

6.48e-6 

8.87e-10 

7.83e-13 

2.37e-4 

6.06e-4 

6 

1.51e-10 

2.00e-8 

8.35e-12 

2.31e-15 

7.18e-5 

4.29e-4 

7 

1.42e-5 

8.48e-4 

3.79e-7 

9.95e-ll 

7.64e-5 

4.29e-4 

8 

1.28e-19 

2.24e-16 

2.49e-17 

2.85e-17 

1.39e-4 

5.51e-4 

9 

9.98e-17 

1.20e-14 

2.49e-17 

6.05e-17 

1.47e-4 

3.88e-4 

10 

3.67e-14 

1.05e-12 

1.41e-15 

1.42e-17 

2.37e-4 

6.10e-4 

11 

1.54e-12 

1.28e-10 

7.68e-14 

2.67e-17 

1.03e-4 

7.44e-4 

12 

1.52e-10 

2.00e-8 

8.94e-12 

3.17e-15 

7.68e-5 

6.12e-4 

13 

3.10e-8 

3.83e-6 

2.23e-9 

7.44e-13 

1.00e-4 

7.41e-4 

14 

8.98e-6 

8.56e-4 

4.39e-7 

1.37e-10 

8.84e-5 

6.04e-4 

15 

5.44e-4 

6.07e-2 

2.30e-5 

5.90e-9 

6.51e-5 

4.04e-4 
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6 Concluding Remarks 


The main conclusions of this work are that although block LU factorization is unstable in general, 
it is stable for matrices that are block diagonally dominant by columns, and generally the level 
of instability is bounded in terms of the condition number and the growth factor for Gaussian 
elimination without pivoting. Therefore if the matrix is symmetric positive definite or (point) 
diagonally dominant, stability is assured if A is well-conditioned. These results are summarised in 
Table 6.1, which tabulates a bound for ||A— LU\\/(c n u\\A\\) for block and point LU factorization with 
the matrix properties considered in sections 3 and 4. The constant c n incorporates any constants 
in the bound that depend poly normally on the dimension, so a value of 1 in the table indicates 
unconditional stability. 

The implications for practical computation are that when using block LU factorization to solve 
Ax = b (which we certainly do not discourage) it is vital to check the relative residual (or normwise 
backward error) ||Ax — 6||c»/(||A||oo||£||oo + Halloo)* If the residual is unacceptably large it is worth 
trying one step of iterative refinement in fixed precision, although this is not guaranteed to yield 
a smaller residual if the instability is severe. Note that one may be fortunate enough to obtain an 
acceptable x even if || A — Zi/||oo/||^||oo is large, as our numerical experiments illustrate. 

A more general conclusion is that the stability of a block algorithm can not be taken for granted. 
Existing error analysis for point algorithms is not directly applicable to block algorithms; it is, 
however, applicable to partitioned algorithms. A complicating feature is that there may be several 
possible block reformulations of a basic algorithm to consider, as is the case with Algorithm BLU in 
section 2. Assessing the stability of other block algorithms is clearly an interesting area for further 
research. 
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Table 6.1: Stability of LU factorization. 
p n is the growth factor for GE without pivoting, 
r is the maximum block size. 


Matrix property 

Block LU 

Point LU 

symmetric positive definite 

K (A) 1 ' 2 

i 

block column diag. dom. 

1 

p n li(A) 

point column diag. dom. 

2 r-2 

1 

block row diag. dom. 


PnK(A) 

point row diag. dom. 

k(A) 

1 

arbitrary 

pI*(A) 

Pnli(A) 
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